One important goal of statistical inference is to learn about the data generating process. In the simple case, this can mean to find out if a certain independent variable influences our outcome of interest. If we do a randomized experiment, we can just compare the outcomes conditional on different values of the independent variable to find out if the independent variable is part of the data generating process.
Keeping with our experiment example, we can also describe two alternative statistical models (e.g., using regression models) that describe the data generating process, where one model assumes an effect of the independent variable, but the other does not. Then we can compare the two models and declare the model that fits the data better to be the winner. The problem with this approach is that if the addition of one independent variable is the only difference between the two models, the model with more variables will always fit the data better, even if it is not the correct model.
Hence, we need more sophisticated approaches to determine which of two models captures the data generating process better.
We are not just interested in finding a model that describes the data we are seeing well; then we could just choose the best fitting model. Instead, we want to use the data at hand to learn about the process that generated the data we see. After all, only knowledge about this process generalizes to other situations.
The fact that more complex models always fit the data better is related to the term overfitting. Overfitting means that the fitted model does not only describe the data generating process, but also takes part of the data that is just noise to be part of the data generating model.
For instance, assume that we are looking at the relationship between income and well being, which could have this (simulated) relationship:
par(mar=c(3,3,0,1), mgp=c(1,.5,0), tck=-.01)
N = 1000
set.seed(123456)
x = sort(rnorm(N,mean = 0))
bf.s = splines::bs(x,knots =3)
bf.b = matrix(c(1,1,1,0),ncol = 1)
mu = bf.s %*% bf.b
y = rnorm(N, mu , sd = .1)
par(mfrow = c(1,1))
plot_xy = function(x,y,mu, idx = NULL, lwd = 1) {
if (is.null(idx)) idx = 1:length(x)
plot(x[idx],y[idx],
ylab = "",
xlab = "",
xlim = quantile(x,c(.01,.99)),
ylim = quantile(y,c(.01,.99)))
lines(x,mu,col = "red", lwd = lwd)
}
plot_xy(x,y,mu)
# N = 500
# x = rnorm(N,mean = 0)
# y = rnorm(N,b1*x - b2*x^2, sd = 1)
# par(mfrow = c(1,1))
# plot(x,y)
# lines(sort(x),b1*sort(x) - b2*sort(x)^2, col = "red")
However, we only have a sample of N = 15 participants to
learn about this relationship. The following figure shows different
random samples in rows, and predictions of regression models of
increasing complexity, from a simple linear model \(\small \mu = \alpha + \beta x\) in column 1
to a polynomial of the 4th order \(\small \mu
= \alpha + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4\)
in column 4. The red lines show the true relationship and the blue lines
show the learned relationships.
plot_fits = function(x,y,mu,sample_size) {
par(mfrow = c(4,4), mar=c(2,2,.5,.5), mgp=c(1.25,.25,0), tck=-.01)
for (k in 1:4) {
idx = sample(N,sample_size)
dt = data.frame(x = x[idx], y = y[idx])
q.model = alist(
y ~ dnorm(mu,exp(log_sigma)),
mu <- a + b*x,
a ~ dnorm(0.5,1),
b ~ dnorm(0,10),
log_sigma ~ dnorm(0,1)
)
plot_xy(x,y,mu,idx)
plot_quap_preds(q.model,dt,"x", plot = F)
plot_xy(x,y,mu,idx)
q.model[[2]] = alist(mu <- a + b[1]*x + b[2]*x^2)[[1]]
plot_quap_preds(q.model,dt,"x", start.list = list(b=rep(0,5)), plot = F)
plot_xy(x,y,mu,idx)
q.model[[2]] = alist(mu <- a + b[1]*x + b[2]*x^2 + b[3]*x^3)[[1]]
plot_quap_preds(q.model,dt,"x", start.list = list(b=rep(0,5)), plot = F)
plot_xy(x,y,mu,idx)
q.model[[2]] = alist(mu <- a + b[1]*x + b[2]*x^2 + b[3]*x^3 + b[4]*x^4)[[1]]
plot_quap_preds(q.model,dt,"x", start.list = list(b=rep(0,5)), plot = F)
}
}
plot_fits(x,y,mu,sample_size = 15)
This figure shows that while making the model increasingly complex allows us to fit the data ever better (see the \(\small R^2\) values), it does not necessarily allow us to capture the true data generating process better. By looking at model predictions at extreme values for x, we can also see that complex models overfit.
What happens if we increase the sample size to 75?
plot_fits(x,y,mu,sample_size = 75)
The variation in the inferred associations between different samples is now smaller, but we still see overfitting and also underfitting, i.e. the inability of the simplest model to capture the non-linear relationship. This would be better visible with posterior predictive plots, which would show a systematic differences between data and model predictions, i.e. overestimation of \(\small y\) and large and small \(\small x\).
Comparing columns 2 and 4 shows overfitting: Even though the blue lines are closer to the red line in column 2, the \(\small R^2\) is generally highest in column 4.1
The goal of model comparison is then to identify the model that best describes the data generating process, i.e. that does suffer the least from over- or underfitting.
To identify which model is good, we need some measure of model accuracy, or consistency of model and data.
Imagine, you are working with a drug rehabilitation center and try to evaluate different statistical models that predict if a patient will have a relapse within 4 weeks after leaving the center. Your data look as follows:
set.seed(1234)
treatment.weeks = round(rnorm(15,mean = 3, sd = 1),1);
dt = data.frame(
treatment.weeks,
p.relapse = round(inv_logit(-treatment.weeks+1.25),2),
relapse = (round(inv_logit(-treatment.weeks+1.25),2) > runif(15))*1 )
kable(dt[, c("treatment.weeks", "relapse")]) %>%
kable_styling(full_width = F)
| treatment.weeks | relapse |
|---|---|
| 1.8 | 0 |
| 3.3 | 0 |
| 4.1 | 0 |
| 0.7 | 1 |
| 3.4 | 0 |
| 3.5 | 0 |
| 2.4 | 1 |
| 2.5 | 0 |
| 2.4 | 0 |
| 2.1 | 0 |
| 2.5 | 0 |
| 2.0 | 0 |
| 2.2 | 0 |
| 3.1 | 0 |
| 4.0 | 0 |
We can build 2 predictions models, one which takes the number of treatment weeks into account, and one that just assumes that treatment is good and nobody ever has a relapse. Here is the data with predictions from these models:
dt$pred.no_relapse = 0
dt$pred.weeks =
glm(dt$relapse~dt$treatment.weeks, family = binomial()) %>%
predict(type = "response") %>%
round(3)
kable(dt[, c("treatment.weeks", "relapse", "pred.no_relapse", "pred.weeks")]) %>%
kable_styling(full_width = F)
| treatment.weeks | relapse | pred.no_relapse | pred.weeks |
|---|---|---|---|
| 1.8 | 0 | 0 | 0.286 |
| 3.3 | 0 | 0 | 0.012 |
| 4.1 | 0 | 0 | 0.002 |
| 0.7 | 1 | 0 | 0.843 |
| 3.4 | 0 | 0 | 0.009 |
| 3.5 | 0 | 0 | 0.007 |
| 2.4 | 1 | 0 | 0.089 |
| 2.5 | 0 | 0 | 0.071 |
| 2.4 | 0 | 0 | 0.089 |
| 2.1 | 0 | 0 | 0.165 |
| 2.5 | 0 | 0 | 0.071 |
| 2.0 | 0 | 0 | 0.200 |
| 2.2 | 0 | 0 | 0.135 |
| 3.1 | 0 | 0 | 0.018 |
| 4.0 | 0 | 0 | 0.002 |
A simple way to calculate accuracy would be to compute the average probability to make a correct prediction. Here are these average probabilities for the two models:
av_acc.no_relapse =
(sum(1-dt$pred.no_relapse[dt$relapse == 0]) + # prob. non-relapse
sum(dt$pred.no_relapse[dt$relapse == 1])) / # prob. relapse
15 # divide by N
av_acc.weeks =
(sum(1-dt$pred.weeks[dt$relapse == 0]) +
sum(dt$pred.weeks[dt$relapse == 1]))/15
av_acc.no_relapse
## [1] 0.8666667
av_acc.weeks
## [1] 0.8576667
We can see that, by this metric, the model that ignores additional information and the true data generating process performs better. The ability to favor models that are inconsistent with the data generating process is an undesirable property of an accuracy criterion.
Recall that in Bayesian updating we used the joint probability of the data given the model, which is something different than the average probability we just calculated. One intuition of why the joint probability is useful is that because it involves multiplication of all probabilities, it does not allow us to ignore some of the data as long as we predict most of the data well. A related view point is that the averaging allows us to favor models that make impossible predictions, like the probability of relapse is zero for everyone.
Hence, here is the joint probability for our two models:
joint_prob.no_relapse =
prod(1-dt$pred.no_relapse[dt$relapse == 0]) *
prod(dt$pred.no_relapse[dt$relapse == 1])
joint_prob.weeks =
prod(1-dt$pred.weeks[dt$relapse == 0]) *
prod(dt$pred.weeks[dt$relapse == 1])
joint_prob.no_relapse
## [1] 0
joint_prob.weeks
## [1] 0.02314252
Now we see that the model that is consistent with the data generating process has a better “score”. Compare the code for the average accuracy and the joint probability to see that the difference is that the former approach sums over probabilities whereas the latter multiplies.
Instead of multiplying probabilities, we can also sum log-probabilities:
joint_log_prob.no_relapse =
sum(log(1-dt$pred.no_relapse[dt$relapse == 0])) +
sum(log(dt$pred.no_relapse[dt$relapse == 1]))
joint_log_prob.weeks =
sum(log(1-dt$pred.weeks[dt$relapse == 0])) +
sum(log(dt$pred.weeks[dt$relapse == 1]))
joint_log_prob.no_relapse
## [1] -Inf
joint_log_prob.weeks
## [1] -3.766084
Multiplying probabilities and adding log probabilities leads to the same results. The results differ slightly only due to numerical problems that arise when multiplying small probabilities:
joint_prob.weeks - exp(joint_log_prob.weeks)
## [1] -6.938894e-18
The result from summing log probabilities is more accurate!
The concepts of information and uncertainty are tightly related:
Measures of information or uncertainty should have following properties:
Paraphrasing Wikipedia:
Entropy quantifies the [average] amount of uncertainty involved in the outcome of a random process. For example, revealing the outcome of a fair coin flip (with two equally likely outcomes) provides less information (lower entropy) than revealing the outcome from a roll of a die (with six equally likely outcomes).
Entropy is our measure of uncertainty!
To show how to calculate entropy we are starting with
\[ H(p) = \mathbb{E}[-log(p_i)] = -\sum_{i=1}^{n} p_i log(p_i) \]
Here, \(\small H(p)\) is the information contained in the vector \(\small p\) with occurrence probabilities \(p_i\) and \(\small \mathbb{E}\) is the expectation or the weighted mean, where the weights are given by the probability for the events.
Why \(\small log(p_i)\)? The marginal gain of information becomes smaller: The information gained by learning that an event happens with probability .1 instead of .05 is greater than the information gained by learning that an event happens with probability .75 instead of .7.
curve(log(x), ylab = "log(p)", xlab = "p")
lines(c(.05,.05), log(c(.05,.1)), col = "red", lwd = 2)
lines(c(.7,.7), log(c(.7,.75)), col = "red", lwd = 2)
In R code we can write:
H = function(p) -sum(p*log(p))
Note that if one \(\small p_i = 1\) the sum of all other \(\small p_i\)s must be zero and entropy / uncertainty becomes zero.
What provides more information?
here is the entropy for a coin toss:
H(rep(1,2)/2)
## [1] 0.6931472
and here is the entropy for a roll of a die:
H(rep(1,6)/6)
## [1] 1.791759
Small changes in the probability of an event result in small changes in the information measure.
Lets look at the simple case of only two possible events. The bottom panel in the following figure shows on the x-axis the probability of the first event \(\small p_1\) and on the y axis the entropy:
p1 = seq(.01,.99, length.out = 99)
entropy = do.call(
c,
lapply(p1, function(p1) H(c(p1,1-p1)))
)
layout(matrix(c(1,2,2), ncol = 1))
par(mar=c(0,3,1,1), mgp=c(1.25,.3,0), tck=-.01)
plot(0,type = "n", xlim = c(0,1), ylim = c(0,1),
xlab = "", xaxt = "n",
ylab = expression(p[1]+p[2]))
polygon(x = c(0,1,1,0), y = c(0,0,1,0), col = "blue")
polygon(x = c(0,1,0,0), y = c(0,1,1,0), col = "red")
text(.75,.25, expression(p["i=1"]), cex = 2, col = "white")
text(.25,.75, expression(p["i=2"]), cex = 2, col = "white")
par(mar=c(3,3,1,1), mgp=c(1.25,.3,0), tck=-.01)
plot(p1,entropy, type = 'l', xlab = expression(p[1]))
The figure shows that when we have two events and assign them equal probability, then uncertainty/entropy are large, whereas uncertainty/entropy become smaller when we assign one of the tow options a higher probability.
All changes in entropy are continuous in changes of \(\small p_1\), i.e. they do not have a step function.
What happens with entropy if we increase the number of events \(\small n\) and calculate entropy for uniform distributions over events?
n = 2:8
entropies =
do.call(
c,
lapply(n, function(n) H(rep(1,n)/n))
)
names(entropies) = n
par(mar=c(3,3,1,1), mgp=c(1.25,.3,0), tck=-.01)
barplot(entropies, ylab = "entropy", xlab = "number possible events with equal probability")
As the number of possible events increases, entropy increases.
A related property is that if we start with probabilities for two events, e.g.
P2 = c(A1 = .5, A2 = .5)
and further split one of the two events
P2s = c(A1 = .5, A2.1 = .25, A2.2 = .25)
than the entropy of the second distribution / model should be larger:
cbind(H(P2), H(P2s))
## [,1] [,2]
## [1,] 0.6931472 1.039721
If two events A and B are independent, the information gained from observing both A and B (joint probabilities) is the sum of the information gained from A alone and B alone (marginal probabilities).
Lets assume we have the following distributions of event types A and B:
P = matrix(c(.125,.375,.125,.375), ncol = 2)
colnames(P) = c("A1","A2")
rownames(P) = c("B1","B2")
P %>%
kable() %>%
kable_styling(full_width = F)
| A1 | A2 | |
|---|---|---|
| B1 | 0.125 | 0.125 |
| B2 | 0.375 | 0.375 |
Given these cells, we can calculate the margins:
P = addmargins(P)
P %>%
kable() %>%
kable_styling(full_width = F)
| A1 | A2 | Sum | |
|---|---|---|---|
| B1 | 0.125 | 0.125 | 0.25 |
| B2 | 0.375 | 0.375 | 0.75 |
| Sum | 0.500 | 0.500 | 1.00 |
And given the margins, we can also calculate the cell entries. For example:
\[ \small \begin{aligned} P(A=1, B=1) & =P(A=1)P(B=1)&\\ & =.5 \cdot .25&\\ & =.125 \end{aligned} \]
Because that we can calculate the table cells from the margins and vice versa, the information / entropy in the cells representation and the margins representation should be the same, i.e.
\[ \small \begin{aligned} P_{cells} = \{&P(A=1,B=1), \\ &P(A=1,B=0), \\&P(A=0,B=1),\\ &P(A=0,B=0)\}\\ P_A = \{&P(A=1),\\ &P(A=0)\} \\ P_B = \{&P(B=1),\\ & P(B=0)\} \\ H(P_{cells}) = &H(P_A) + H(P_B ) \end{aligned} \]
Lets try this out:
H_cells = H(c(.125, .125, .375, .375))
H_margins = H(c(.5,.5)) + H(c(.25,.75))
cbind(H_cells, H_margins)
## H_cells H_margins
## [1,] 1.255482 1.255482
To summarize, you can keep in mind that \[ entropy = uncertainty = -information \]
To understand why entropy is useful to measure the accuracy of a model, one can reframe the problem and ask “Does a model contain the same information as the true data generating process?”.
We can talk about information
because we can calculate the probability of the data given the model
both for the true data generating process and the candidate model (what
we create with the sim function from rethinking
packages).
Let’s use our initial example again to make clear what we mean when we measure accuracy:
par(mar=c(3,3,2,1), mgp=c(1.5,.7,0), tck=-.01)
layout(matrix(c(1,3,2,3), ncol = 2))
plot(x,y)
lines(x, mu, col = "red")
title("DGP and population data")
set.seed(1)
idx = sample(N,25)
dt = data.frame(x = x[idx], y = y[idx])
q.model = alist(
y ~ dnorm(mu,exp(log_sigma)),
mu <- a + b[1]*x + b[2]*x^2,
a ~ dnorm(0.5,1),
b ~ dnorm(0,10),
log_sigma ~ dnorm(0,1)
)
q.fit = quap(q.model,data = dt, start = list(b = rep(0,2)))
plot_xy(x,y,mu, idx)
plot_quap_preds(q.model,dt,"x", plot = F, start.list = list(b = rep(0,3)))
mtext(1,text = "x", line = 2.5)
mtext(2,text = "y", line = 2.5)
title("DGP, sample & inferred DGP / model")
preds = plot_quap_preds(q.model,dt,"x", plot = F,
start.list = list(b = rep(0,3)),
return.yhat = TRUE)
preds$y = predict(bf.s,newx = preds$x) %*% bf.b
plot(preds$x, preds$yhat,'l', col = "blue", ylab = "", xlab = "", xlim = c(-2.4,2.4))
mtext(1,text = "x", line = 1.5)
mtext(2,text = "y", line = 1.75)
arrows(x0 = preds$x, y0 = preds$y,
y1 = preds$yhat,length = .05, code = 3,
col = adjustcolor("purple",alpha = .5))
lines(preds$x, preds$y,'l', col = "red")
lines(preds$x, preds$yhat,'l', col = "blue")
title("Difference between true DGP and candidate model")
legend("topleft", lty = 1, col = c("red","blue"), legend = c("true DGP", "model"), bty = "n")
We can calculate the probability of data according to the true DGP and the inferred model. In our example, the probability of the data given the true DGP is
\(\small y_i \sim Normal(1\cdot bs_1(x_i) + 1\cdot bs_2(x_i) + 1\cdot bs_3(x_i) + 0\cdot bs_4(x_i), 0.1)\)2
or in R code:
mu = bf.s %*% bf.b # using matrix multiplication
As before, we can use p to calculate how much
information about the data we have in the DPG. Entropy tells us how much
information about the observed data we have in our model.
p_DGP = dnorm(y, mu , 0.1)
The probability of the data given the inferred model is (using posterior means)
precis(q.fit, depth = 2)
## mean sd 5.5% 94.5%
## b[1] 0.11919107 0.02319151 0.08212656 0.15625559
## b[2] -0.04835877 0.02104013 -0.08198497 -0.01473258
## a 0.97484959 0.02788155 0.93028948 1.01940970
## log_sigma -2.25773912 0.14667811 -2.49215907 -2.02331917
\(\small y_i \sim Normal(0.97 + 0.11 \cdot x_i - 0.05 \cdot x_i^2, 0.1)\)
or in R code:
mu = 0.97 + 0.12*x - 0.05*x^2
p_model = dnorm(y, mu, exp(-2.25))
If we want then to learn about the accuracy of a candidate model, we can compute how much information about the data it has, relative to the true DGP (which is also a model). This differences or deviance in information about the data is called the Kullback-Leibler divergence or KL divergence or short \(\small D_{KL}\).
The Kullback-Leibler divergence in R:
dKL = function(p,q) {
sum(p*(log(p)-log(q)))
}
which we use to plot the divergence between the distribution
p = c(.375,.625) and alternative distributions
q:
With probabilities of observed events given the true DGP \(\small p\) and model predicted probabilities \(\small q\) the Kullback-Leibler divergence is defined as: \[ \small \begin{aligned} D_{KL}(p,q) & = \sum_i p_i \,log(p_i) - p_i \, log(q_i) \\ & = \sum_i p_i \, \left(log(p_i) - log(q_i)\right) \\ & = \sum_i p_i \, log \left( \frac{p_i}{q_i} \right) \end{aligned} \]
Here, \(\small p_i \,log(p_i)\) is the entropy and \(\small p_i \,log(q_i)\) is the cross entropy.
Note that if \(\small p = q\) then \(\small log(p_i) - log(q_i) = 0\) and we can say that the candidate model has no additional uncertainty about the data on top of the uncertainty that is already present given the DGP: \(\small D_{KL}(p,q) = 0 \; | \; p = q\). Therefore we can say that the Kullback-Leibler divergence is a measure of additional uncertainty due to using a candidate model instead of the true model.
So far, we have determined that
However, we generally do not know the true model, we only have a sample of data that was generated from it. This is not really a problem, because our initial goal was model comparison; we want to compare the deviances of two models, here \(\small q\) and \(\small r\):
\[\small D_{KL}(p,q) - D_{KL}(p,r)\] Here is how the deviances of the two models are defined:
\[ D_{KL}(p,q) = \sum_i p_i log(p_i) - p_i log(q_i) \\ D_{KL}(p,r) = \sum_i p_i log(p_i) - p_i log(r_i) \]
If we want to compute the relative accuracy of the models, we calculate
\[ \textrm{relative accuracy} = \sum_i p_i log(p_i) - p_i log(q_i) - \sum_i p_i log(p_i) - p_i log(r_i) \]
Luckily, we can simplify this becuase
Therefore we can remove these terms and are then in the delightful situation that we only need to know about \(\small \sum_i log(q_i)\) and \(\small \sum_i log(r_i)\) if we want to compare candidate models:
\[ \textrm{relative accuracy} = \sum_i log(q_i) - \sum_i log(r_i) \]
To see how we can calculate this quantity for a model, remember that we started with the definition of entropy:
\[ H(p) = \mathbb{E}[-log(p_i)] = -\sum_{i=1}^{n} p_i log(p_i) \]
where \(\small p_i\) is the probability of observing some data given the model (and its parameters). Where have we heard this before?
\[ \overset{\color{violet}{\text{posterior probability}}}{P(parameter|data)} = \frac{\overset{\color{red}{\text{likelihood}}}{P(data|parameter)} \cdot \overset{\color{blue}{\text{prior probability}}}{P(parameter)}}{\overset{\color{orange}{\text{evidence}}}{P(data)}} \]
This is just the likelihood term we already use when we do Bayesian updating. One important thing, when computing this quantity in a Bayesian context, is that the \(\small paramter\) is not a single value, but a posterior distribution.
The rethinking package has a function lppd,
which calculates for each data point the log of the mean probability
(density) of the observations given the model and posterior distribution
of parameters, also called the log pointwise predictive
density.
Coming back to this example:
q.model = alist(
y ~ dnorm(mu,exp(log_sigma)),
mu <- a + b[1]*x + b[2]*x^2,
a ~ dnorm(0.5,1),
b ~ dnorm(0,10),
log_sigma ~ dnorm(0,1)
)
q.fit = quap(q.model, dt, start = list(b = rep(0,2)))
We are calculating for each data point its average probability or density as follows:
ll = sim(q.fit,ll = TRUE,n = 10) # log likelihood, 10 posterior samples
pd = exp(ll) # predictive density
pd[,1:7]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 3.103094 0.7244417 2.1879643 3.170319 2.401484 2.562270 2.608732
## [2,] 4.222944 0.1947807 1.5830725 4.476915 2.601612 2.536477 2.566610
## [3,] 4.224661 0.2382558 0.5407222 4.329622 2.910189 2.236943 1.882277
## [4,] 4.019846 0.4414507 1.5400368 4.052146 2.137097 3.066981 2.989443
## [5,] 3.345985 0.4037935 0.7210154 3.342705 3.108511 1.711312 1.473847
## [6,] 3.577617 0.5842039 1.6940047 3.511569 2.316777 2.763000 2.737932
## [7,] 3.187418 0.7721257 1.0856073 3.540546 2.216649 2.824227 2.488812
## [8,] 3.805839 0.3655112 2.4570885 3.952249 2.520633 2.699598 2.899687
## [9,] 3.808209 0.4281001 0.4423692 4.062624 2.641725 2.462164 1.932263
## [10,] 3.867629 0.4729086 1.0927847 3.912741 2.337205 2.820196 2.600774
We get a matrix with as many columns as cases (showing only 7 out of 25 here) and rows as samples. Note that we are dealing with densities here, which can be greater than 0.
ppd = colMeans(pd) # pointwise predictive density
lppd = log(ppd) # log pointwise predicitive density
lppd
## [1] 1.3127351 -0.7709851 0.2885316 1.3442069 0.9239367 0.9432507
## [7] 0.8829564 1.0096229 1.2220954 1.1726152 -0.5119216 1.3177163
## [13] 0.7820240 0.3318089 1.3297479 1.3251596 -0.2128348 1.3661615
## [19] 1.3549807 0.7472119 0.7405731 1.0796010 1.1725731 1.1302970
## [25] 1.3308160
In the following figure, each blue line is a prediction from the posterior distribution, and vertical lines represent the distance of observed data from predictions, where larger distances correspond to smaller probabilities of the data given the model.
par(mar=c(3,3,0.5,0.5), mgp=c(1.25,.5,0), tck=-.01)
plot_xy(x,y,mu, idx, lwd = 2.5)
mtext(1,text = "x", line = 2)
mtext(2,text = "y", line = 2)
set.seed(123)
mu.train = link(q.fit, data = data.frame(x = c(x[idx],seq(-2.5,2.5,length.out = 100))), n = 50)
matlines(seq(-2.5,2.5,length.out = 100),y = t(mu.train[,-(1:25)]), lty = 1,
col = adjustcolor("blue", alpha = .25))
os = seq(-.02,.02, length.out = 50)
for (k in 1:50)
arrows(x[idx] + os[k], y0 = y[idx], y1 = mu.train[k,1:25], length = 0,
col = adjustcolor("black", alpha = .1))
legend("topleft", bty = "n",
pch = c(1,NA,NA),
lwd = c(1,2.5,1),
lty = c(0,1,1),
col = c("black","red","blue"),
legend = c("data","DGP mu","posterior mu samples"),)
Given a quap model, the log pointwise predictive density (lppd) is simply calculated as:
lppd(q.fit)
## [1] 1.2058139 -0.4395136 0.3245750 1.2433400 0.8734743 0.9510972
## [7] 0.8767728 0.9876721 1.0368104 1.1452246 -0.4270184 1.2775434
## [13] 0.7828610 0.5720692 1.2258546 1.2611329 -0.1524585 1.3136410
## [19] 1.2645250 0.6905799 0.8296157 1.0465825 1.1396387 1.1462925
## [25] 1.3012294
The lppd is our stand-in for the probability values we know from the basic definition of entropy if we want to calculate (relative) divergence.
So far, we have calculated the divergence for the data we observed. The problem is that this is still a divergence in model fit to the observed data. However, what we want to compare is how well to different models are consistent with the data generating process, or the data we can observe from it.
One pragmatic way to go about this is to evaluate the model fit not with the data we used to fit the model parameters (training data) but on a new set of test data which we did not use to estimate parameters.
The following figure shows in addition to 25 data points used to estimated the model parameters 25 new test data points that were not used to estimate model parameter. We want to calculate the deviance of the model predictions from these test data.
To visualize that we are using the posterior distribution to
calculate deviance, the figure also shows 25 lines (mu),
one for each posterior sample. The figure does not show variations in
the standard deviation.
par(mar=c(3,3,0.5,0.5), mgp=c(1.25,.5,0), tck=-.01)
plot_xy(x,y,mu, idx, lwd = 2.5)
mtext(1,text = "x", line = 2)
mtext(2,text = "y", line = 2)
set.seed(123)
test.idx = sample(setdiff(1:length(x),idx),25)
mu.test = link(q.fit, data = data.frame(x = c(x[test.idx],seq(-2.5,2.5,length.out = 100))), n = 50)
matlines(seq(-2.5,2.5,length.out = 100),y = t(mu.test[,-(1:25)]), lty = 1,
col = adjustcolor("blue", alpha = .25))
os = seq(-.02,.02, length.out = 50)
for (k in 1:50)
arrows(x[test.idx] + os[k], y0 = y[test.idx], y1 = mu.test[k,1:25], length = 0, col = adjustcolor("purple", alpha = .2))
points(x[test.idx],y[test.idx], col = "red", pch = 16)
legend("topleft", bty = "n",
pch = c(1,16,NA,NA),
lwd = c(1,0,2.5,1),
lty = c(0,0,1,1),
col = c("black","red","red","blue"),
legend = c("training data","test data","DGP mu","posterior mu samples"),)
We can use the following figure from the book to explain why deviance in the test data set (cross validation) is a better approach to model comparison:
Model comparison (done correctly) helps to choose the model that provides a good representation of the true DGP by penalizing models that “overfit”. This penalization is achieved mainly by assessing “fit” not on a training data set, but on a hold out test data set.
A complementary approach to work against “overfitting” is to specify priors that shrink model coefficients towards zero. Such shrinkage priors are typically normally distributed, have a mean of zero and a relatively small standard deviation. Here relative refers to the scale on which a predictor is measured.
To show how shrinkage works, we estimate spline models with different standard deviations on regression coefficients for the simulated income and well-being data above.
The following figure shows the estimated relationships for different samples drawn from the population.